State-Level ChatGPT Search Intensity Analysis

Research Question

How is search intensity for AI (specifically ChatGPT) associated with income, education, broadband access, and industry composition at the state level?

This analysis examines correlations between: - ChatGPT search intensity (Google Trends) - Median household income - Education levels (% with bachelor’s degree or higher) - Internet access (% households with internet) - Knowledge economy employment (% in knowledge industries)

1. Setup and Installation

1.1 Install Required Packages

Run this cell first if you get ModuleNotFoundError

1.2 Import Libraries

Code
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from pathlib import Path
from census import Census

# Configure Plotly for Quarto/HTML rendering
pio.renderers.default = "plotly_mimetype+notebook"

# Set up paths
PROJECT_ROOT = Path.cwd()
DATA_RAW = PROJECT_ROOT / 'Data' / 'raw'
DATA_PROCESSED = PROJECT_ROOT / 'Data' / 'processed'

# Create directories if they don't exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

print("✓ Libraries imported")
print(f"✓ Project root: {PROJECT_ROOT}")
print(f"✓ Data raw folder: {DATA_RAW}")
print(f"✓ Data processed folder: {DATA_PROCESSED}")
✓ Libraries imported
✓ Project root: c:\Users\16468\OneDrive - PennO365\Documents\Academics\MUSA\MUSA_5500\Final Project
✓ Data raw folder: c:\Users\16468\OneDrive - PennO365\Documents\Academics\MUSA\MUSA_5500\Final Project\Data\raw
✓ Data processed folder: c:\Users\16468\OneDrive - PennO365\Documents\Academics\MUSA\MUSA_5500\Final Project\Data\processed

3. Census API Setup

3.1 Initialize Census API

Code
# Load Census API key from config file
try:
    from config import CENSUS_API_KEY
    c = Census(CENSUS_API_KEY, year=2022)  # Using 2022 ACS 5-year estimates
    print("Census API initialized")
except ImportError:
    print("⚠ Error: config.py not found")
    print("Please create config.py with your Census API key:")
    print("  CENSUS_API_KEY = 'your_api_key_here'")
    c = None
except Exception as e:
    print(f"⚠ Error initializing Census API: {e}")
    c = None
Census API initialized

4. Collect ACS Data

4.1 Broadband Access (Table B28002)

Code
# ACS Table B28002: Internet Subscription by Type
# Using simpler approach: Total with internet subscription (any type)
# Variables:
# B28002_001E: Total households
# B28002_002E: With internet subscription (any type - includes broadband, cellular, satellite, etc.)

if c:
    broadband_data = c.acs5.state(
        ('NAME', 'B28002_001E', 'B28002_002E'),
        Census.ALL
    )
    
    broadband_df = pd.DataFrame(broadband_data)
    
    # Convert to numeric
    numeric_cols = ['B28002_001E', 'B28002_002E']
    for col in numeric_cols:
        broadband_df[col] = pd.to_numeric(broadband_df[col], errors='coerce')
    
    # Calculate percentages
    # pct_with_internet: % of households with any internet subscription (our main measure)
    # pct_no_internet: % of households with no internet subscription (calculated as Total - With Internet)
    broadband_df['pct_with_internet'] = (broadband_df['B28002_002E'] / broadband_df['B28002_001E']) * 100
    broadband_df['pct_no_internet'] = ((broadband_df['B28002_001E'] - broadband_df['B28002_002E']) / broadband_df['B28002_001E']) * 100
    
    # Rename state column
    broadband_df = broadband_df.rename(columns={'NAME': 'State'})
    
    # Remove 'United States' row if present
    broadband_df = broadband_df[broadband_df['State'] != 'United States']
    
    print(f"Broadband data shape: {broadband_df.shape}")
    print("\nFirst 5 states:")
    print(broadband_df[['State', 'B28002_001E', 'B28002_002E', 'pct_with_internet', 'pct_no_internet']].head())
    print("\nVerification (should add to 100%):")
    broadband_df['check_sum'] = broadband_df['pct_with_internet'] + broadband_df['pct_no_internet']
    print(broadband_df[['State', 'pct_with_internet', 'pct_no_internet', 'check_sum']].head())
    print("\nSummary statistics:")
    print(broadband_df[['pct_with_internet', 'pct_no_internet']].describe())
    print("\nNote: pct_with_internet includes all types of internet (broadband, cellular, satellite, etc.)")
    print("      pct_no_internet is calculated as: (Total - With Internet) / Total")
else:
    print("⚠ Census API not initialized. Cannot fetch broadband data.")
    broadband_df = pd.DataFrame()
Broadband data shape: (52, 6)

First 5 states:
        State  B28002_001E  B28002_002E  pct_with_internet  pct_no_internet
0     Alabama    1933150.0    1625807.0          84.101441        15.898559
1      Alaska     264376.0     236875.0          89.597770        10.402230
2     Arizona    2739136.0    2448838.0          89.401841        10.598159
3    Arkansas    1171694.0     968272.0          82.638641        17.361359
4  California   13315822.0   12195945.0          91.589877         8.410123

Verification (should add to 100%):
        State  pct_with_internet  pct_no_internet  check_sum
0     Alabama          84.101441        15.898559      100.0
1      Alaska          89.597770        10.402230      100.0
2     Arizona          89.401841        10.598159      100.0
3    Arkansas          82.638641        17.361359      100.0
4  California          91.589877         8.410123      100.0

Summary statistics:
       pct_with_internet  pct_no_internet
count          52.000000        52.000000
mean           87.693465        12.306535
std             3.369030         3.369030
min            73.186992         7.724270
25%            86.687535        10.018033
50%            88.242682        11.757318
75%            89.981967        13.312465
max            92.275730        26.813008

Note: pct_with_internet includes all types of internet (broadband, cellular, satellite, etc.)
      pct_no_internet is calculated as: (Total - With Internet) / Total

4.2 Education Level (Table B15003)

Code
# ACS Table B15003: Educational Attainment
# Calculate % of adults (25+) with bachelor's degree or higher
# Variables:
# B15003_001E: Total (population 25 years and over)
# B15003_022E through B15003_025E: Bachelor's, Master's, Professional, Doctorate degrees

if c:
    education_data = c.acs5.state(
        ('NAME', 'B15003_001E', 'B15003_022E', 'B15003_023E', 'B15003_024E', 'B15003_025E'),
        Census.ALL
    )
    
    education_df = pd.DataFrame(education_data)
    
    # Convert to numeric
    numeric_cols = ['B15003_001E', 'B15003_022E', 'B15003_023E', 'B15003_024E', 'B15003_025E']
    for col in numeric_cols:
        education_df[col] = pd.to_numeric(education_df[col], errors='coerce')
    
    # Calculate % with bachelor's degree or higher
    # Sum: Bachelor's + Master's + Professional + Doctorate
    education_df['bachelors_plus'] = (
        education_df['B15003_022E'] +  # Bachelor's degree
        education_df['B15003_023E'] +  # Master's degree
        education_df['B15003_024E'] +  # Professional degree
        education_df['B15003_025E']    # Doctorate degree
    )
    
    education_df['pct_bachelors_plus'] = (education_df['bachelors_plus'] / education_df['B15003_001E']) * 100
    
    # Rename state column
    education_df = education_df.rename(columns={'NAME': 'State'})
    
    # Remove 'United States' row if present
    education_df = education_df[education_df['State'] != 'United States']
    
    print(f"Education data shape: {education_df.shape}")
    print("\nFirst 5 states:")
    print(education_df[['State', 'pct_bachelors_plus']].head())
    print("\nSummary statistics:")
    print(education_df['pct_bachelors_plus'].describe())
else:
    print("⚠ Census API not initialized. Cannot fetch education data.")
    education_df = pd.DataFrame()
Education data shape: (52, 9)

First 5 states:
        State  pct_bachelors_plus
0     Alabama           27.208387
1      Alaska           30.748285
2     Arizona           31.798873
3    Arkansas           24.705551
4  California           35.864402

Summary statistics:
count    52.000000
mean     33.767735
std       6.697265
min      22.711161
25%      30.083528
50%      32.898278
75%      36.378206
max      62.636093
Name: pct_bachelors_plus, dtype: float64

4.3 Median Household Income (Table B19013)

Code
# ACS Table B19013: Median Household Income
# Variable: B19013_001E: Median household income in the past 12 months

if c:
    income_data = c.acs5.state(
        ('NAME', 'B19013_001E'),
        Census.ALL
    )
    
    income_df = pd.DataFrame(income_data)
    
    # Convert to numeric
    income_df['B19013_001E'] = pd.to_numeric(income_df['B19013_001E'], errors='coerce')
    
    # Rename columns
    income_df = income_df.rename(columns={
        'NAME': 'State',
        'B19013_001E': 'median_household_income'
    })
    
    # Remove 'United States' row if present
    income_df = income_df[income_df['State'] != 'United States']
    
    print(f"Income data shape: {income_df.shape}")
    print("\nFirst 5 states:")
    print(income_df.head())
    print("\nSummary statistics:")
    print(income_df['median_household_income'].describe())
else:
    print("⚠ Census API not initialized. Cannot fetch income data.")
    income_df = pd.DataFrame()
Income data shape: (52, 3)

First 5 states:
        State  median_household_income state
0     Alabama                  59609.0    01
1      Alaska                  86370.0    02
2     Arizona                  72581.0    04
3    Arkansas                  56335.0    05
4  California                  91905.0    06

Summary statistics:
count        52.000000
mean      73828.057692
std       14182.829754
min       24002.000000
25%       66302.250000
50%       72090.000000
75%       84827.250000
max      101722.000000
Name: median_household_income, dtype: float64

4.4 Industry Composition (BLS Data)

Load manually downloaded BLS industry data with % Knowledge Economy column.

Code
# Load BLS Industry Data - Using your manually calculated %knowledge economy column
# The file should have a State column and a %knowledge economy column (rightmost column)

try:
    # Try CSV first (if you saved as CSV)
    csv_file = DATA_RAW / 'BLS_Industry_Data.csv'
    excel_file = DATA_RAW / 'BLS_Industry_Data.xlsx'
    
    if csv_file.exists():
        bls_data = pd.read_csv(csv_file)
        print(f"✓ Loaded BLS data from CSV: {csv_file}")
    elif excel_file.exists():
        bls_data = pd.read_excel(excel_file)
        print(f"✓ Loaded BLS data from Excel: {excel_file}")
    else:
        print(f"⚠ BLS file not found. Looking for:")
        print(f"  - {csv_file}")
        print(f"  - {excel_file}")
        bls_data = pd.DataFrame()
    
    if not bls_data.empty:
        print(f"\nFile shape: {bls_data.shape}")
        print(f"Columns: {bls_data.columns.tolist()}")
        print(f"\nFirst few rows:")
        print(bls_data.head())
        
        # Find the State column
        state_col = None
        for col in bls_data.columns:
            if 'state' in str(col).lower() and 'code' not in str(col).lower():
                state_col = col
                break
        
        # Use the rightmost column for %knowledge economy (the one you added)
        knowledge_col = bls_data.columns[-1]
        print(f"\n✓ Using rightmost column for knowledge economy: '{knowledge_col}'")
        
        if state_col:
            # Create final dataset
            industry_final = bls_data[[state_col, knowledge_col]].copy()
            industry_final.columns = ['State', 'pct_knowledge']
            
            # Clean state names if needed
            industry_final['State'] = industry_final['State'].str.strip()
            
            # Convert pct_knowledge to numeric
            industry_final['pct_knowledge'] = pd.to_numeric(industry_final['pct_knowledge'], errors='coerce')
            
            # Remove any rows with missing data
            industry_final = industry_final.dropna()
            
            print(f"\n✓ Industry data processed successfully!")
            print(f"  States: {len(industry_final)}")
            print(f"\nFirst 10 states:")
            print(industry_final.head(10))
            print(f"\nSummary statistics for % Knowledge Economy:")
            print(industry_final['pct_knowledge'].describe().round(2))
        else:
            print(f"\n⚠ Could not find State column.")
            industry_final = pd.DataFrame()
    
except Exception as e:
    print(f"Error loading BLS data: {e}")
    import traceback
    traceback.print_exc()
    industry_final = pd.DataFrame()
✓ Loaded BLS data from Excel: c:\Users\16468\OneDrive - PennO365\Documents\Academics\MUSA\MUSA_5500\Final Project\Data\raw\BLS_Industry_Data.xlsx

File shape: (52, 17)
Columns: ['State', 'Total nonfarm', 'Mining and logging', 'Mining, logging, and construction', 'Construction', 'Manufacturing', 'Trade, transportation, and utilities', 'Information', 'Financial activities', 'Professional and business services', 'Education and health services', 'Leisure and hospitality', 'Other services', 'Government', 'Knowledge Economy ', 'Other', '% Knowledge']

First few rows:
        State  Total nonfarm  Mining and logging  \
0     Alabama         2214.9                 9.2   
1      Alaska          339.0                11.8   
2     Arizona         3253.7                16.2   
3    Arkansas         1385.8                 5.1   
4  California        18011.2                18.8   

   Mining, logging, and construction  Construction  Manufacturing  \
0                              119.7         110.5          287.3   
1                               31.6          19.8           12.5   
2                              242.6         226.4          192.4   
3                               70.6          65.5          164.7   
4                              911.8         893.0         1211.2   

   Trade, transportation, and utilities  Information  Financial activities  \
0                                 411.7         22.9                 102.6   
1                                  68.0          4.0                  10.7   
2                                 615.6         46.4                 241.8   
3                                 273.9         11.7                  70.9   
4                                3071.7        526.6                 783.4   

   Professional and business services  Education and health services  \
0                               262.7                          272.2   
1                                29.6                           53.5   
2                               458.3                          560.3   
3                               161.1                          222.9   
4                              2726.9                         3466.1   

   Leisure and hospitality  Other services Government  Knowledge Economy   \
0                    213.4           100.3      422.1               660.4   
1                     35.8            12.9       80.4                97.8   
2                    363.3           106.6      426.4              1306.8   
3                    132.0            65.2      212.8               466.6   
4                   2012.3           595.8     2705.4              7503.0   

     Other  % Knowledge  
0   1554.5     0.298162  
1    241.2     0.288496  
2   1946.9     0.401635  
3    919.2     0.336701  
4  10508.2     0.416574  

✓ Using rightmost column for knowledge economy: '% Knowledge'

✓ Industry data processed successfully!
  States: 51

First 10 states:
                  State  pct_knowledge
0               Alabama       0.298162
1                Alaska       0.288496
2               Arizona       0.401635
3              Arkansas       0.336701
4            California       0.416574
5              Colorado       0.378873
6           Connecticut       0.435008
7              Delaware       0.422641
8  District of Columbia       0.443389
9               Florida       0.403557

Summary statistics for % Knowledge Economy:
count    51.00
mean      0.37
std       0.05
min       0.22
25%       0.33
50%       0.37
75%       0.40
max       0.49
Name: pct_knowledge, dtype: float64

5. Merge all data

Code
# Merge All Data Together
# Combine Google Trends, Broadband, Education, Income, and Industry data

# Check if merged file already exists
merged_file = DATA_PROCESSED / 'merged_state_data.csv'

if merged_file.exists():
    print("✓ Found existing merged data file. Loading from disk...")
    merged_df = pd.read_csv(merged_file)
    
    # Convert pct_knowledge from decimal (0.4) to percentage (40) if needed
    if 'pct_knowledge' in merged_df.columns:
        # Check if values are in decimal format (max < 1) and convert if needed
        if merged_df['pct_knowledge'].max() < 1:
            print("  Converting pct_knowledge from decimal to percentage format...")
            merged_df['pct_knowledge'] = merged_df['pct_knowledge'] * 100
    
    print(f"✓ Loaded merged data from: {merged_file}")
    print(f"  Shape: {merged_df.shape}")
    print(f"  Columns: {merged_df.columns.tolist()}")
    print(f"  Number of states: {len(merged_df)}")
    print(f"\nFirst few rows:")
    print(merged_df.head())
else:
    print("⚠ Merged file not found. Creating merged dataset from scratch...")
    
    # First, ensure trends_df has a 'State' column - reload if needed
    if 'State' not in trends_df.columns or trends_df.empty:
        print("Reloading Google Trends data...")
        trends_file = DATA_RAW / 'chatgpt_googletrends.csv'
        if trends_file.exists():
            trends_raw = pd.read_csv(trends_file)
            # Use first row as column names
            trends_raw.columns = trends_raw.iloc[0]
            trends_raw = trends_raw.drop(trends_raw.index[0]).reset_index(drop=True)
            
            # Find columns
            state_col = None
            interest_col = None
            for col in trends_raw.columns:
                if 'region' in str(col).lower():
                    state_col = col
                if 'chatgpt' in str(col).lower():
                    interest_col = col
            
            if state_col and interest_col:
                trends_df = trends_raw[[state_col, interest_col]].copy()
                trends_df.columns = ['State', 'Avg_AI_Interest']
                trends_df = trends_df.dropna()
                print(f"✓ Reloaded trends_df. Shape: {trends_df.shape}")
            else:
                print("ERROR: Could not find columns in CSV")
        else:
            print("ERROR: CSV file not found")

    print(f"\ntrends_df columns: {trends_df.columns.tolist()}")
    print(f"trends_df shape: {trends_df.shape}")

    # Now merge
    merged_df = trends_df.copy()

    # Merge broadband data
    merged_df = merged_df.merge(
        broadband_df[['State', 'pct_with_internet', 'pct_no_internet']],
        on='State',
        how='inner'
    )

    # Merge education data
    merged_df = merged_df.merge(
        education_df[['State', 'pct_bachelors_plus']],
        on='State',
        how='inner'
    )

    # Merge income data
    merged_df = merged_df.merge(
        income_df[['State', 'median_household_income']],
        on='State',
        how='inner'
    )

    # Merge industry/knowledge economy data (if available)
    if 'industry_final' in locals() and len(industry_final) > 0:
        merged_df = merged_df.merge(
            industry_final[['State', 'pct_knowledge']],
            on='State',
            how='inner'
        )
        # Convert pct_knowledge from decimal (0.4) to percentage (40) for consistency
        # Check if values are in decimal format (max < 1) and convert if needed
        if merged_df['pct_knowledge'].max() < 1:
            print("  Converting pct_knowledge from decimal to percentage format...")
            merged_df['pct_knowledge'] = merged_df['pct_knowledge'] * 100
        print("✓ Knowledge economy data merged")
    else:
        print("⚠ Knowledge economy data not available - skipping merge")
        merged_df['pct_knowledge'] = np.nan

    print(f"\nMerged dataset shape: {merged_df.shape}")
    print(f"Number of states: {len(merged_df)}")
    print(f"\nColumn names:")
    print(merged_df.columns.tolist())
    print(f"\nFirst few rows:")
    print(merged_df.head())

    # Save merged dataset
    merged_df.to_csv(DATA_PROCESSED / 'merged_state_data.csv', index=False)
    print(f"\n✓ Merged data saved to: {DATA_PROCESSED / 'merged_state_data.csv'}")
✓ Found existing merged data file. Loading from disk...
  Converting pct_knowledge from decimal to percentage format...
✓ Loaded merged data from: c:\Users\16468\OneDrive - PennO365\Documents\Academics\MUSA\MUSA_5500\Final Project\Data\processed\merged_state_data.csv
  Shape: (51, 7)
  Columns: ['State', 'Avg_AI_Interest', 'pct_with_internet', 'pct_no_internet', 'pct_bachelors_plus', 'median_household_income', 'pct_knowledge']
  Number of states: 51

First few rows:
                  State  Avg_AI_Interest  pct_with_internet  pct_no_internet  \
0  District of Columbia              100          89.118863        10.881137   
1            California               98          91.589877         8.410123   
2               Georgia               92          87.889289        12.110711   
3            New Jersey               90          90.722078         9.277922   
4                 Texas               88          88.519366        11.480634   

   pct_bachelors_plus  median_household_income  pct_knowledge  
0           62.636093                 101722.0      44.338876  
1           35.864402                  91905.0      41.657413  
2           33.633675                  71355.0      37.749020  
3           42.252478                  97126.0      42.964549  
4           32.265897                  73035.0      36.787778  
Code
# Add Red/Blue State Dummy Variable
# Based on 2020 Presidential Election results
# 1 = Blue State (Democratic), 0 = Red State (Republican)

# States that voted Democratic in 2020 (Blue States)
blue_states_2020 = {
    'Alabama': 0, 'Alaska': 0, 'Arizona': 1, 'Arkansas': 0, 'California': 1,
    'Colorado': 1, 'Connecticut': 1, 'Delaware': 1, 'Florida': 0, 'Georgia': 1,
    'Hawaii': 1, 'Idaho': 0, 'Illinois': 1, 'Indiana': 0, 'Iowa': 0,
    'Kansas': 0, 'Kentucky': 0, 'Louisiana': 0, 'Maine': 1, 'Maryland': 1,
    'Massachusetts': 1, 'Michigan': 1, 'Minnesota': 1, 'Mississippi': 0, 'Missouri': 0,
    'Montana': 0, 'Nebraska': 0, 'Nevada': 1, 'New Hampshire': 1, 'New Jersey': 1,
    'New Mexico': 1, 'New York': 1, 'North Carolina': 0, 'North Dakota': 0, 'Ohio': 0,
    'Oklahoma': 0, 'Oregon': 1, 'Pennsylvania': 1, 'Rhode Island': 1, 'South Carolina': 0,
    'South Dakota': 0, 'Tennessee': 0, 'Texas': 0, 'Utah': 0, 'Vermont': 1,
    'Virginia': 1, 'Washington': 1, 'West Virginia': 0, 'Wisconsin': 1, 'Wyoming': 0,
    'District of Columbia': 1
}

# Add blue_state dummy variable (1 = Blue/Democratic, 0 = Red/Republican)
merged_df['blue_state'] = merged_df['State'].map(blue_states_2020)

# Check for any missing values
if merged_df['blue_state'].isna().any():
    missing_states = merged_df[merged_df['blue_state'].isna()]['State'].tolist()
    print(f"⚠ Warning: Missing political classification for: {missing_states}")
    print("  These states will be excluded from regression analysis")
else:
    print("✓ Red/Blue state dummy variable added successfully")

# Display summary
print(f"\nPolitical Classification Summary:")
print(f"  Blue States (Democratic): {merged_df['blue_state'].sum()} states")
print(f"  Red States (Republican): {(merged_df['blue_state'] == 0).sum()} states")
print(f"\nSample of data:")
print(merged_df[['State', 'blue_state']].head(10))
✓ Red/Blue state dummy variable added successfully

Political Classification Summary:
  Blue States (Democratic): 26 states
  Red States (Republican): 25 states

Sample of data:
                  State  blue_state
0  District of Columbia           1
1            California           1
2               Georgia           1
3            New Jersey           1
4                 Texas           0
5               Arizona           1
6              Maryland           1
7              Virginia           1
8               Florida           0
9         Massachusetts           1

6. Scatterplots

Code
# Scatterplots: Each Variable vs ChatGPT Search Intensity
# Create individual scatterplots for all predictor variables

# Variables to plot against ChatGPT search intensity
variables = {
    'median_household_income': 'Median Household Income ($)',
    'pct_bachelors_plus': '% Adults with Bachelor\'s Degree or Higher',
    'pct_with_internet': '% Households with Internet Access',
    'pct_knowledge': '% Knowledge Economy Employment'
}

# Filter out variables that don't exist in the dataset
available_vars = {k: v for k, v in variables.items() if k in merged_df.columns}

# Create subplots - 2x2 grid
n_vars = len(available_vars)
n_cols = 2
n_rows = (n_vars + 1) // 2

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows))
axes = axes.flatten() if n_vars > 1 else [axes]

for idx, (var, label) in enumerate(available_vars.items()):
    ax = axes[idx]
    
    # Remove rows with missing data for this variable
    plot_data = merged_df[[var, 'Avg_AI_Interest']].dropna()
    
    if len(plot_data) > 0:
        # Scatter plot
        ax.scatter(plot_data[var], plot_data['Avg_AI_Interest'], 
                   alpha=0.6, s=100, color='steelblue', edgecolors='black', linewidth=0.5)
        
        # Add regression line
        from scipy import stats
        slope, intercept, r_value, p_value, std_err = stats.linregress(plot_data[var], plot_data['Avg_AI_Interest'])
        line_x = np.linspace(plot_data[var].min(), plot_data[var].max(), 100)
        line_y = intercept + slope * line_x
        ax.plot(line_x, line_y, 'r--', linewidth=2, 
                label=f'R² = {r_value**2:.3f}, p = {p_value:.4f}')
        
        ax.set_xlabel(label, fontsize=12, fontweight='bold')
        ax.set_ylabel('ChatGPT Search Intensity', fontsize=12, fontweight='bold')
        ax.set_title(f'{label}\nvs ChatGPT Search Intensity', fontsize=13, fontweight='bold', pad=10)
        ax.legend(fontsize=10)
        ax.grid(True, alpha=0.3)
    else:
        ax.text(0.5, 0.5, f'No data for {label}', 
                ha='center', va='center', transform=ax.transAxes, fontsize=12)
        ax.set_title(label, fontsize=13, fontweight='bold')

# Hide extra subplots if any
for idx in range(len(available_vars), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

print(f"\n✓ Created scatterplots for {len(available_vars)} variables")


✓ Created scatterplots for 4 variables

7. Map Visualizations

Code
# Interactive Map: Switch Between All Variables
# This interactive map allows you to explore all variables using a dropdown menu

print("=" * 80)
print("INTERACTIVE MAP: Explore All Variables")
print("=" * 80)
print("\nUse the dropdown menu below to switch between different variables.")
print("Hover over states to see detailed information.\n")

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Add state abbreviations if not already present
state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
    'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'District of Columbia': 'DC'
}

if 'state_code' not in merged_df.columns:
    merged_df['state_code'] = merged_df['State'].map(state_abbrev)

# Prepare data for interactive map
interactive_df = merged_df.copy()

# Create a list of all available variables with their display names and color scales
variables_config = [
    {
        'value': 'Avg_AI_Interest',
        'label': 'ChatGPT Search Intensity',
        'colorscale': 'Viridis',
        'colorbar_title': 'Search Intensity',
        'hover_template': '<b>%{customdata[0]}</b><br>Search Intensity: %{z}<extra></extra>'
    },
    {
        'value': 'median_household_income',
        'label': 'Median Household Income',
        'colorscale': 'Blues',
        'colorbar_title': 'Income ($)',
        'hover_template': '<b>%{customdata[0]}</b><br>Median Income: $%{z:,.0f}<extra></extra>'
    },
    {
        'value': 'pct_bachelors_plus',
        'label': '% Bachelor\'s Degree or Higher',
        'colorscale': 'Greens',
        'colorbar_title': '% Bachelor\'s+',
        'hover_template': '<b>%{customdata[0]}</b><br>% Bachelor\'s+: %{z:.1f}%<extra></extra>'
    },
    {
        'value': 'pct_with_internet',
        'label': '% Households with Internet Access',
        'colorscale': 'Oranges',
        'colorbar_title': '% With Internet',
        'hover_template': '<b>%{customdata[0]}</b><br>% With Internet: %{z:.1f}%<extra></extra>'
    },
    {
        'value': 'pct_knowledge',
        'label': '% Knowledge Economy Employment',
        'colorscale': 'Purples',
        'colorbar_title': '% Knowledge Economy',
        'hover_template': '<b>%{customdata[0]}</b><br>% Knowledge Economy: %{z:.1f}%<extra></extra>'
    }
]

# Create initial figure with ChatGPT Search Intensity
fig_interactive = go.Figure()

# Add all traces (one for each variable) but only show the first one initially
for i, var_config in enumerate(variables_config):
    var_name = var_config['value']
    visible = (i == 0)  # Only first variable is visible initially
    
    fig_interactive.add_trace(
        go.Choropleth(
            locations=interactive_df['state_code'],
            z=interactive_df[var_name],
            locationmode='USA-states',
            colorscale=var_config['colorscale'],
            colorbar=dict(
                title=var_config['colorbar_title'],
                thickness=15,
                len=0.5,
                x=1.02
            ),
            hovertemplate=var_config['hover_template'],
            customdata=interactive_df[['State', var_name]],
            visible=visible,
            name=var_config['label']
        )
    )

# Create dropdown menu
dropdown_buttons = []
for i, var_config in enumerate(variables_config):
    # Create visibility list - all False except for the selected variable
    visibility = [False] * len(variables_config)
    visibility[i] = True
    
    dropdown_buttons.append(
        dict(
            label=var_config['label'],
            method='update',
            args=[
                {'visible': visibility},
                {
                    'title': f'{var_config["label"]} by State',
                    'coloraxis.colorbar.title.text': var_config['colorbar_title']
                }
            ]
        )
    )

# Update layout with dropdown
fig_interactive.update_layout(
    title={
        'text': 'Interactive Map: Select Variable to View<br><sub>Use the dropdown menu to switch between variables</sub>',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 18}
    },
    geo=dict(
        scope='usa',
        projection=go.layout.geo.Projection(type='albers usa'),
        showlakes=True,
        lakecolor='rgb(255, 255, 255)'
    ),
    height=600,
    updatemenus=[
        dict(
            buttons=dropdown_buttons,
            direction='down',
            showactive=True,
            x=0.02,
            xanchor='left',
            y=1.02,
            yanchor='top',
            bgcolor='rgba(255, 255, 255, 0.8)',
            bordercolor='rgba(0, 0, 0, 0.2)',
            borderwidth=1,
            font=dict(size=12)
        )
    ],
    margin=dict(l=0, r=0, t=100, b=0)
)

fig_interactive.show()

print("\n✓ Interactive map created")
print("  - Use the dropdown menu in the top-left to switch between variables")
print("  - Hover over states to see detailed information")
print("=" * 80)
================================================================================
INTERACTIVE MAP: Explore All Variables
================================================================================

Use the dropdown menu below to switch between different variables.
Hover over states to see detailed information.

✓ Interactive map created
  - Use the dropdown menu in the top-left to switch between variables
  - Hover over states to see detailed information
================================================================================
Code
# Map Visualizations: Geographic Patterns
# Top: ChatGPT Search Intensity
# Bottom: 2x2 grid with all predictor variables

# Install plotly if needed
import sys
import subprocess
try:
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
except ImportError:
    print("Installing plotly...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "plotly"])
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    print("Plotly installed successfully!")

# Add state abbreviations for mapping
state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
    'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'District of Columbia': 'DC'
}

# Add state codes to merged_df if not already present
if 'state_code' not in merged_df.columns:
    merged_df['state_code'] = merged_df['State'].map(state_abbrev)

# 1. TOP MAP: ChatGPT Search Intensity
print("=" * 80)
print("MAP 1: ChatGPT Search Intensity")
print("=" * 80)

fig_intensity = px.choropleth(
    merged_df,
    locations='state_code',
    locationmode='USA-states',
    color='Avg_AI_Interest',
    scope='usa',
    color_continuous_scale='Viridis',
    title='ChatGPT Search Intensity by State',
    labels={'Avg_AI_Interest': 'Search Intensity'},
    hover_data=['State', 'Avg_AI_Interest']
)
fig_intensity.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12)
)
fig_intensity.update_coloraxes(colorbar=dict(
    thickness=8,
    len=0.5,
    x=1.01
))
fig_intensity.show()

print("\n✓ ChatGPT Search Intensity map created")
print("\n" + "=" * 80)
print("MAP 2: Median Household Income")
print("=" * 80)

# 2. Median Household Income
fig_income = px.choropleth(
    merged_df,
    locations='state_code',
    locationmode='USA-states',
    color='median_household_income',
    scope='usa',
    color_continuous_scale='Blues',
    title='Median Household Income by State',
    labels={'median_household_income': 'Income ($)'},
    hover_data=['State', 'median_household_income']
)
fig_income.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12)
)
fig_income.update_coloraxes(colorbar=dict(
    thickness=8,
    len=0.5,
    x=1.01
))
fig_income.show()

print("\n✓ Median Household Income map created")
print("\n" + "=" * 80)
print("MAP 3: % Bachelor's Degree or Higher")
print("=" * 80)

# 3. Education % Bachelor's+
fig_education = px.choropleth(
    merged_df,
    locations='state_code',
    locationmode='USA-states',
    color='pct_bachelors_plus',
    scope='usa',
    color_continuous_scale='Greens',
    title='% Adults with Bachelor\'s Degree or Higher by State',
    labels={'pct_bachelors_plus': '% Bachelor\'s+'},
    hover_data=['State', 'pct_bachelors_plus']
)
fig_education.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12)
)
fig_education.update_coloraxes(colorbar=dict(
    thickness=8,
    len=0.5,
    x=1.01
))
fig_education.show()

print("\n✓ Education map created")
print("\n" + "=" * 80)
print("MAP 4: % Households with Internet Access")
print("=" * 80)

# 4. Internet Access
fig_internet = px.choropleth(
    merged_df,
    locations='state_code',
    locationmode='USA-states',
    color='pct_with_internet',
    scope='usa',
    color_continuous_scale='Oranges',
    title='% Households with Internet Access by State',
    labels={'pct_with_internet': '% With Internet'},
    hover_data=['State', 'pct_with_internet']
)
fig_internet.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12)
)
fig_internet.update_coloraxes(colorbar=dict(
    thickness=8,
    len=0.5,
    x=1.01
))
fig_internet.show()

print("\n✓ Internet Access map created")
print("\n" + "=" * 80)
print("MAP 5: % Knowledge Economy Employment")
print("=" * 80)

# 5. Knowledge Economy
fig_knowledge = px.choropleth(
    merged_df,
    locations='state_code',
    locationmode='USA-states',
    color='pct_knowledge',
    scope='usa',
    color_continuous_scale='Purples',
    title='% Knowledge Economy Employment by State',
    labels={'pct_knowledge': '% Knowledge Economy'},
    hover_data=['State', 'pct_knowledge']
)
fig_knowledge.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12)
)
fig_knowledge.update_coloraxes(colorbar=dict(
    thickness=8,
    len=0.5,
    x=1.01
))
fig_knowledge.show()

print("\n✓ Knowledge Economy map created")
print("\n✓ All predictor variable maps created")
print("=" * 80)
print("\n" + "=" * 80)
print("MAP 6: Red/Blue States (2020 Presidential Election)")
print("=" * 80)

# 6. Red/Blue States Map
# Create a discrete color map: 0 = Red (Republican), 1 = Blue (Democratic)
# Convert blue_state to categorical string for proper discrete mapping
merged_df_temp = merged_df.copy()
merged_df_temp['political_affiliation'] = merged_df_temp['blue_state'].map({0: 'Republican', 1: 'Democratic'})

fig_political = px.choropleth(
    merged_df_temp,
    locations='state_code',
    locationmode='USA-states',
    color='political_affiliation',
    scope='usa',
    color_discrete_map={'Republican': '#E81B23', 'Democratic': '#0044C9'},  # Red for Republican, Blue for Democratic
    title='Red States vs Blue States (2020 Presidential Election)',
    labels={'political_affiliation': 'Political Affiliation'},
    hover_data=['State', 'political_affiliation']
)
fig_political.update_layout(
    height=500,
    title_x=0.5,
    title_font_size=16,
    font=dict(size=12),
    showlegend=True
)
# Update the traces for better visualization and remove colorbar
fig_political.update_traces(
    marker_line_width=0.5,
    marker_line_color='white',
    showscale=False  # Remove colorbar
)
fig_political.show()

print("\n✓ Red/Blue States map created")
================================================================================
MAP 1: ChatGPT Search Intensity
================================================================================

✓ ChatGPT Search Intensity map created

================================================================================
MAP 2: Median Household Income
================================================================================

✓ Median Household Income map created

================================================================================
MAP 3: % Bachelor's Degree or Higher
================================================================================

✓ Education map created

================================================================================
MAP 4: % Households with Internet Access
================================================================================

✓ Internet Access map created

================================================================================
MAP 5: % Knowledge Economy Employment
================================================================================

✓ Knowledge Economy map created

✓ All predictor variable maps created
================================================================================

================================================================================
MAP 6: Red/Blue States (2020 Presidential Election)
================================================================================

✓ Red/Blue States map created

8. Regression Analysis

Code
# Multiple Linear Regression: All Variables Together
# Dependent variable: Avg_AI_Interest (ChatGPT search intensity)
# Independent variables: Income, Education, Internet, Knowledge Economy, Blue State

# Install statsmodels if needed
import sys
import subprocess
try:
    import statsmodels.api as sm
except ImportError:
    print("Installing statsmodels...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "statsmodels"])
    import statsmodels.api as sm

# Prepare data - remove rows with any missing values
regression_data = merged_df[['Avg_AI_Interest', 'median_household_income', 
                              'pct_bachelors_plus', 'pct_with_internet', 'pct_knowledge', 'blue_state']].dropna()

print("=" * 80)
print("MULTIPLE LINEAR REGRESSION: ALL VARIABLES")
print("=" * 80)
print(f"\nObservations: {len(regression_data)}")
print(f"\nVariables included in the model:")
print(f"  1. Median Household Income")
print(f"  2. % Bachelor's Degree or Higher (pct_bachelors_plus)")
print(f"  3. % Households with Internet Access (pct_with_internet)")
print(f"  4. % Knowledge Economy Employment (pct_knowledge)")
print(f"  5. Blue State (vs Red State) - Dummy variable (blue_state)\n")
print(f"Dependent Variable: ChatGPT Search Intensity (Avg_AI_Interest)\n")

# Prepare X and y
X = regression_data[['median_household_income', 'pct_bachelors_plus', 
                     'pct_with_internet', 'pct_knowledge', 'blue_state']]
y = regression_data['Avg_AI_Interest']

# Add constant term for intercept
X_with_const = sm.add_constant(X)

# Fit the model
model_all = sm.OLS(y, X_with_const).fit()

# Verify all variables are included
print(f"\n✓ Model fitted successfully!")
print(f"  Number of predictors: {len(X.columns)}")
print(f"  Predictor names: {list(X.columns)}")
print(f"  Total parameters (including intercept): {len(model_all.params)}")
print(f"  Model degrees of freedom: {model_all.df_model}")

# Display model summary
print("\n" + "=" * 80)
print("MODEL SUMMARY")
print("=" * 80)
print("\nFull regression results with all 5 predictor variables:\n")
print(model_all.summary())
print("\n" + "=" * 80)

# Create coefficients table with full details
# Add significance markers
var_names = ['Intercept', 'Median Household Income', '% Bachelor\'s+', 
             '% With Internet', '% Knowledge Economy', 'Blue State (vs Red)']
significance_markers = []
for pval in model_all.pvalues:
    if pval < 0.001:
        significance_markers.append('***')
    elif pval < 0.01:
        significance_markers.append('**')
    elif pval < 0.05:
        significance_markers.append('*')
    else:
        significance_markers.append('')

# Add significance to variable names
var_names_with_sig = [f"{var} {sig}" if sig else var for var, sig in zip(var_names, significance_markers)]

coef_table = pd.DataFrame({
    'Variable': var_names_with_sig,
    'Coefficient': model_all.params.values,
    'Std Error': model_all.bse.values,
    't-value': model_all.tvalues.values,
    'P-value': model_all.pvalues.values,
    'Significant': ['Yes' if p < 0.05 else 'No' for p in model_all.pvalues],
    'Lower CI (95%)': model_all.conf_int()[0].values,
    'Upper CI (95%)': model_all.conf_int()[1].values
})

print("\n" + "=" * 80)
print("COEFFICIENTS TABLE (COMPLETE)")
print("=" * 80)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
print("\n" + coef_table.to_string(index=False))
print("\n" + "=" * 80)

# Model interpretation
print(f"\n" + "=" * 80)
print("MODEL INTERPRETATION")
print("=" * 80)
print(f"\nModel Fit:")
print(f"  • R-squared: {model_all.rsquared:.4f} ({model_all.rsquared:.1%})")
print(f"  • Adjusted R-squared: {model_all.rsquared_adj:.4f} ({model_all.rsquared_adj:.1%})")
print(f"  • F-statistic: {model_all.fvalue:.4f}")
print(f"  • F-statistic p-value: {model_all.f_pvalue:.6f}")
if model_all.f_pvalue < 0.05:
    print("  ✓ Model is statistically significant overall")
else:
    print("  ✗ Model is not statistically significant overall")

print(f"\n" + "-" * 80)
print("COEFFICIENT INTERPRETATION:")
print("-" * 80)

# Create a summary of significant variables
significant_vars = []
non_significant_vars = []

for i, (var, coef, pval) in enumerate(zip(['Intercept', 'Median Household Income', '% Bachelor\'s+', 
                                           '% With Internet', '% Knowledge Economy', 'Blue State (vs Red)'], 
                                          model_all.params, model_all.pvalues)):
    significance = "***" if pval < 0.001 else "**" if pval < 0.01 else "*" if pval < 0.05 else ""
    print(f"\n{var}:")
    print(f"  Coefficient: {coef:.6f}")
    print(f"  P-value: {pval:.6f} {significance}")
    if i > 0:  # Skip intercept
        if pval < 0.05:
            significant_vars.append((var, coef, pval, significance))
            if var == 'Blue State (vs Red)':
                direction = "higher" if coef > 0 else "lower"
                print(f"  Interpretation: ✓ Statistically significant - Blue states have {direction} ChatGPT search intensity than red states (by {abs(coef):.2f} points)")
            else:
                direction = "increases" if coef > 0 else "decreases"
                print(f"  Interpretation: ✓ Statistically significant - {direction} ChatGPT search intensity")
        else:
            non_significant_vars.append((var, coef, pval))
            print(f"  Interpretation: ✗ Not statistically significant (p = {pval:.4f})")

# Summary of significant variables
print("\n" + "=" * 80)
print("SUMMARY: SIGNIFICANT PREDICTORS")
print("=" * 80)
if significant_vars:
    print(f"\n{len(significant_vars)} variable(s) are statistically significant:")
    for var, coef, pval, sig in significant_vars:
        print(f"  • {var}: p = {pval:.4f} {sig}")
else:
    print("\n⚠ No individual predictors are statistically significant at p < 0.05")

if non_significant_vars:
    print(f"\n{len(non_significant_vars)} variable(s) are NOT statistically significant:")
    for var, coef, pval in non_significant_vars:
        print(f"  • {var}: p = {pval:.4f}")

print("\n" + "-" * 80)
print("NOTE ON MULTICOLLINEARITY:")
print("-" * 80)
print("When variables are highly correlated (multicollinearity), individual predictors")
print("may not be significant even though the model overall is significant.")
print("This is because the variables share predictive power, making it difficult to")
print("separate their individual effects. Check VIF values to assess multicollinearity.")
print("=" * 80)
================================================================================
MULTIPLE LINEAR REGRESSION: ALL VARIABLES
================================================================================

Observations: 51

Variables included in the model:
  1. Median Household Income
  2. % Bachelor's Degree or Higher (pct_bachelors_plus)
  3. % Households with Internet Access (pct_with_internet)
  4. % Knowledge Economy Employment (pct_knowledge)
  5. Blue State (vs Red State) - Dummy variable (blue_state)

Dependent Variable: ChatGPT Search Intensity (Avg_AI_Interest)


✓ Model fitted successfully!
  Number of predictors: 5
  Predictor names: ['median_household_income', 'pct_bachelors_plus', 'pct_with_internet', 'pct_knowledge', 'blue_state']
  Total parameters (including intercept): 6
  Model degrees of freedom: 5.0

================================================================================
MODEL SUMMARY
================================================================================

Full regression results with all 5 predictor variables:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Avg_AI_Interest   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.319
Method:                 Least Squares   F-statistic:                     5.683
Date:                Sat, 06 Dec 2025   Prob (F-statistic):           0.000382
Time:                        23:28:09   Log-Likelihood:                -185.18
No. Observations:                  51   AIC:                             382.4
Df Residuals:                      45   BIC:                             393.9
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                      47.4779     70.790      0.671      0.506     -95.100     190.056
median_household_income     0.0003      0.000      1.169      0.248      -0.000       0.001
pct_bachelors_plus          0.0595      0.422      0.141      0.889      -0.791       0.910
pct_with_internet          -0.4460      0.956     -0.466      0.643      -2.372       1.480
pct_knowledge               0.9469      0.403      2.348      0.023       0.135       1.759
blue_state                 -0.8240      3.904     -0.211      0.834      -8.688       7.040
==============================================================================
Omnibus:                        0.980   Durbin-Watson:                   0.784
Prob(Omnibus):                  0.613   Jarque-Bera (JB):                1.016
Skew:                           0.301   Prob(JB):                        0.602
Kurtosis:                       2.660   Cond. No.                     3.94e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.94e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

================================================================================
COEFFICIENTS TABLE (COMPLETE)
================================================================================

               Variable  Coefficient  Std Error   t-value  P-value Significant  Lower CI (95%)  Upper CI (95%)
              Intercept    47.477917  70.789798  0.670689 0.505847          No      -95.100056      190.055889
Median Household Income     0.000350   0.000299  1.169385 0.248406          No       -0.000253        0.000953
          % Bachelor's+     0.059543   0.422372  0.140973 0.888520          No       -0.791158        0.910245
        % With Internet    -0.445992   0.956123 -0.466459 0.643138          No       -2.371723        1.479739
  % Knowledge Economy *     0.946938   0.403272  2.348135 0.023323         Yes        0.134706        1.759170
    Blue State (vs Red)    -0.824011   3.904356 -0.211049 0.833802          No       -8.687788        7.039766

================================================================================

================================================================================
MODEL INTERPRETATION
================================================================================

Model Fit:
  • R-squared: 0.3870 (38.7%)
  • Adjusted R-squared: 0.3189 (31.9%)
  • F-statistic: 5.6825
  • F-statistic p-value: 0.000382
  ✓ Model is statistically significant overall

--------------------------------------------------------------------------------
COEFFICIENT INTERPRETATION:
--------------------------------------------------------------------------------

Intercept:
  Coefficient: 47.477917
  P-value: 0.505847 

Median Household Income:
  Coefficient: 0.000350
  P-value: 0.248406 
  Interpretation: ✗ Not statistically significant (p = 0.2484)

% Bachelor's+:
  Coefficient: 0.059543
  P-value: 0.888520 
  Interpretation: ✗ Not statistically significant (p = 0.8885)

% With Internet:
  Coefficient: -0.445992
  P-value: 0.643138 
  Interpretation: ✗ Not statistically significant (p = 0.6431)

% Knowledge Economy:
  Coefficient: 0.946938
  P-value: 0.023323 *
  Interpretation: ✓ Statistically significant - increases ChatGPT search intensity

Blue State (vs Red):
  Coefficient: -0.824011
  P-value: 0.833802 
  Interpretation: ✗ Not statistically significant (p = 0.8338)

================================================================================
SUMMARY: SIGNIFICANT PREDICTORS
================================================================================

✓ 1 variable(s) are statistically significant:
  • % Knowledge Economy: p = 0.0233 *

✗ 4 variable(s) are NOT statistically significant:
  • Median Household Income: p = 0.2484
  • % Bachelor's+: p = 0.8885
  • % With Internet: p = 0.6431
  • Blue State (vs Red): p = 0.8338

--------------------------------------------------------------------------------
NOTE ON MULTICOLLINEARITY:
--------------------------------------------------------------------------------
When variables are highly correlated (multicollinearity), individual predictors
may not be significant even though the model overall is significant.
This is because the variables share predictive power, making it difficult to
separate their individual effects. Check VIF values to assess multicollinearity.
================================================================================
Code
# Multicollinearity Check: Variance Inflation Factor (VIF)
# VIF > 10 indicates potential multicollinearity issues

from statsmodels.stats.outliers_influence import variance_inflation_factor

print("=" * 80)
print("MULTICOLLINEARITY DIAGNOSTICS (VIF)")
print("=" * 80)

# Use X from the full model (should be defined from previous cell)
# If not, prepare it
if 'X' not in locals():
    if 'regression_data' not in locals() or (isinstance(regression_data, pd.DataFrame) and regression_data.empty):
        regression_data = merged_df[['Avg_AI_Interest', 'median_household_income', 
                                      'pct_bachelors_plus', 'pct_with_internet', 'pct_knowledge', 'blue_state']].dropna()
    X = regression_data[['median_household_income', 'pct_bachelors_plus', 
                         'pct_with_internet', 'pct_knowledge', 'blue_state']]

# Calculate VIF for each predictor variable
# Note: We use X (without constant) for VIF calculation
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factors:")
print(vif_data.to_string(index=False))

print("\n" + "-" * 80)
print("Interpretation:")
print("-" * 80)
print("- VIF < 5: No multicollinearity concern")
print("- VIF 5-10: Moderate multicollinearity")
print("- VIF > 10: High multicollinearity (may need to address)")

high_vif = vif_data[vif_data['VIF'] > 10]
if len(high_vif) > 0:
    print(f"\n⚠ Warning: {len(high_vif)} variable(s) with VIF > 10:")
    print(high_vif.to_string(index=False))
    print("\n⚠⚠⚠ MULTICOLLINEARITY DETECTED ⚠⚠⚠")
    print("High VIF values indicate that the predictors are highly correlated with each other,")
    print("making it difficult to separate their individual effects.")
    print("\nThis explains why:")
    print("  • The model is significant overall but individual predictors may not be")
    print("  • The condition number is very high")
    print("  • Coefficient estimates may be unstable")
    print("\nRecommendation: Consider using individual regressions or removing highly correlated variables")
else:
    print("\n✓ No severe multicollinearity detected (all VIF < 10)")

print("=" * 80)
================================================================================
MULTICOLLINEARITY DIAGNOSTICS (VIF)
================================================================================

Variance Inflation Factors:
               Variable        VIF
median_household_income 163.747450
     pct_bachelors_plus 108.726848
      pct_with_internet 147.544144
          pct_knowledge 120.113120
             blue_state   4.184271

--------------------------------------------------------------------------------
Interpretation:
--------------------------------------------------------------------------------
- VIF < 5: No multicollinearity concern
- VIF 5-10: Moderate multicollinearity
- VIF > 10: High multicollinearity (may need to address)

⚠ Warning: 4 variable(s) with VIF > 10:
               Variable        VIF
median_household_income 163.747450
     pct_bachelors_plus 108.726848
      pct_with_internet 147.544144
          pct_knowledge 120.113120

⚠⚠⚠ MULTICOLLINEARITY DETECTED ⚠⚠⚠
High VIF values indicate that the predictors are highly correlated with each other,
making it difficult to separate their individual effects.

This explains why:
  • The model is significant overall but individual predictors may not be
  • The condition number is very high
  • Coefficient estimates may be unstable

Recommendation: Consider using individual regressions or removing highly correlated variables
================================================================================
Code
# Regression with Knowledge and Blue State Only
# Simple two-predictor model: pct_knowledge and blue_state

print("=" * 80)
print("REGRESSION: KNOWLEDGE + BLUE STATE ONLY")
print("=" * 80)

# Prepare data - remove rows with any missing values for these specific variables
regression_data_simple = merged_df[['Avg_AI_Interest', 'pct_knowledge', 'blue_state']].dropna()

print(f"\nObservations: {len(regression_data_simple)}")
print(f"\nVariables included in the model:")
print(f"  1. % Knowledge Economy Employment (pct_knowledge)")
print(f"  2. Blue State (vs Red State) - Dummy variable (blue_state)")
print(f"\nDependent Variable: ChatGPT Search Intensity (Avg_AI_Interest)\n")

# Prepare X and y
X_simple = regression_data_simple[['pct_knowledge', 'blue_state']]
y_simple = regression_data_simple['Avg_AI_Interest']

# Add constant term for intercept
X_simple_const = sm.add_constant(X_simple)

# Fit the model
model_simple = sm.OLS(y_simple, X_simple_const).fit()

print(f"✓ Model fitted successfully!")
print(f"  Number of predictors: {len(X_simple.columns)}")
print(f"  Predictor names: {list(X_simple.columns)}")

# Display model summary
print("\n" + "=" * 80)
print("MODEL SUMMARY")
print("=" * 80)
print("\nRegression results with Knowledge and Blue State:\n")
print(model_simple.summary())
print("\n" + "=" * 80)

# Create coefficients table
var_names = ['Intercept', '% Knowledge Economy', 'Blue State (vs Red)']
significance_markers = []
for pval in model_simple.pvalues:
    if pval < 0.001:
        significance_markers.append('***')
    elif pval < 0.01:
        significance_markers.append('**')
    elif pval < 0.05:
        significance_markers.append('*')
    else:
        significance_markers.append('')

var_names_with_sig = [f"{var} {sig}" if sig else var for var, sig in zip(var_names, significance_markers)]

coef_table_simple = pd.DataFrame({
    'Variable': var_names_with_sig,
    'Coefficient': model_simple.params.values,
    'Std Error': model_simple.bse.values,
    't-value': model_simple.tvalues.values,
    'P-value': model_simple.pvalues.values,
    'Significant': ['Yes' if p < 0.05 else 'No' for p in model_simple.pvalues],
    'Lower CI (95%)': model_simple.conf_int()[0].values,
    'Upper CI (95%)': model_simple.conf_int()[1].values
})

print("\n" + "=" * 80)
print("COEFFICIENTS TABLE")
print("=" * 80)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print("\n" + coef_table_simple.to_string(index=False))
print("\n" + "=" * 80)

# Model interpretation
print(f"\n" + "=" * 80)
print("MODEL INTERPRETATION")
print("=" * 80)
print(f"\nModel Fit:")
print(f"  • R-squared: {model_simple.rsquared:.4f} ({model_simple.rsquared:.1%})")
print(f"  • Adjusted R-squared: {model_simple.rsquared_adj:.4f} ({model_simple.rsquared_adj:.1%})")
print(f"  • F-statistic: {model_simple.fvalue:.4f}")
print(f"  • F-statistic p-value: {model_simple.f_pvalue:.6f}")
if model_simple.f_pvalue < 0.05:
    print("  ✓ Model is statistically significant overall")
else:
    print("  ✗ Model is not statistically significant overall")

print(f"\n" + "-" * 80)
print("COEFFICIENT INTERPRETATION:")
print("-" * 80)

# Interpret each coefficient
for i, (var, coef, pval) in enumerate(zip(['Intercept', '% Knowledge Economy', 'Blue State (vs Red)'], 
                                          model_simple.params, model_simple.pvalues)):
    significance = "***" if pval < 0.001 else "**" if pval < 0.01 else "*" if pval < 0.05 else ""
    print(f"\n{var}:")
    print(f"  Coefficient: {coef:.6f}")
    print(f"  P-value: {pval:.6f} {significance}")
    if i == 1:  # Knowledge Economy
        if pval < 0.05:
            direction = "increases" if coef > 0 else "decreases"
            print(f"  Interpretation: ✓ Statistically significant - A 1 percentage point increase in knowledge economy employment {direction} ChatGPT search intensity by {abs(coef):.2f} points")
        else:
            print(f"  Interpretation: ✗ Not statistically significant (p = {pval:.4f})")
    elif i == 2:  # Blue State
        if pval < 0.05:
            direction = "higher" if coef > 0 else "lower"
            print(f"  Interpretation: ✓ Statistically significant - Blue states have {direction} ChatGPT search intensity than red states (by {abs(coef):.2f} points)")
        else:
            print(f"  Interpretation: ✗ Not statistically significant (p = {pval:.4f})")

# Check VIF for this simpler model
print("\n" + "=" * 80)
print("MULTICOLLINEARITY CHECK (VIF)")
print("=" * 80)
vif_simple = pd.DataFrame()
vif_simple["Variable"] = X_simple.columns
vif_simple["VIF"] = [variance_inflation_factor(X_simple.values, i) for i in range(X_simple.shape[1])]
print("\n" + vif_simple.to_string(index=False))
high_vif_simple = vif_simple[vif_simple['VIF'] > 10]
if len(high_vif_simple) > 0:
    print(f"\n⚠ Warning: {len(high_vif_simple)} variable(s) with VIF > 10")
else:
    print("\n✓ No multicollinearity concerns (all VIF < 10)")

print("=" * 80)
================================================================================
REGRESSION: KNOWLEDGE + BLUE STATE ONLY
================================================================================

Observations: 51

Variables included in the model:
  1. % Knowledge Economy Employment (pct_knowledge)
  2. Blue State (vs Red State) - Dummy variable (blue_state)

Dependent Variable: ChatGPT Search Intensity (Avg_AI_Interest)

✓ Model fitted successfully!
  Number of predictors: 2
  Predictor names: ['pct_knowledge', 'blue_state']

================================================================================
MODEL SUMMARY
================================================================================

Regression results with Knowledge and Blue State:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Avg_AI_Interest   R-squared:                       0.329
Model:                            OLS   Adj. R-squared:                  0.301
Method:                 Least Squares   F-statistic:                     11.76
Date:                Sat, 06 Dec 2025   Prob (F-statistic):           6.97e-05
Time:                        23:28:09   Log-Likelihood:                -187.49
No. Observations:                  51   AIC:                             381.0
Df Residuals:                      48   BIC:                             386.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            27.4584     11.926      2.302      0.026       3.480      51.437
pct_knowledge     1.1454      0.352      3.250      0.002       0.437       1.854
blue_state        2.4993      3.585      0.697      0.489      -4.709       9.707
==============================================================================
Omnibus:                        0.295   Durbin-Watson:                   0.667
Prob(Omnibus):                  0.863   Jarque-Bera (JB):                0.443
Skew:                           0.152   Prob(JB):                        0.801
Kurtosis:                       2.660   Cond. No.                         325.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

================================================================================

================================================================================
COEFFICIENTS TABLE
================================================================================

              Variable  Coefficient  Std Error  t-value  P-value Significant  Lower CI (95%)  Upper CI (95%)
           Intercept *    27.458393  11.925721 2.302451 0.025689         Yes        3.480123       51.436663
% Knowledge Economy **     1.145363   0.352459 3.249632 0.002114         Yes        0.436696        1.854031
   Blue State (vs Red)     2.499283   3.584887 0.697172 0.489058          No       -4.708616        9.707182

================================================================================

================================================================================
MODEL INTERPRETATION
================================================================================

Model Fit:
  • R-squared: 0.3289 (32.9%)
  • Adjusted R-squared: 0.3009 (30.1%)
  • F-statistic: 11.7613
  • F-statistic p-value: 0.000070
  ✓ Model is statistically significant overall

--------------------------------------------------------------------------------
COEFFICIENT INTERPRETATION:
--------------------------------------------------------------------------------

Intercept:
  Coefficient: 27.458393
  P-value: 0.025689 *

% Knowledge Economy:
  Coefficient: 1.145363
  P-value: 0.002114 **
  Interpretation: ✓ Statistically significant - A 1 percentage point increase in knowledge economy employment increases ChatGPT search intensity by 1.15 points

Blue State (vs Red):
  Coefficient: 2.499283
  P-value: 0.489058 
  Interpretation: ✗ Not statistically significant (p = 0.4891)

================================================================================
MULTICOLLINEARITY CHECK (VIF)
================================================================================

     Variable      VIF
pct_knowledge 2.443524
   blue_state 2.443524

✓ No multicollinearity concerns (all VIF < 10)
================================================================================